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data  and  good  for  instrument-rated  data.  Estimates  of  median  TFH  were  derived  for  each  dataset,  which  will  be 
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Predicting  General  Aviation  Accident  Frequency 
From  Pilot  Total  Flight  Hours 


Figure  1 .  Frequency  histogram  of  a)  fatal  accident  count  (y-axis)  for  instrument-rated  GA  pilots  as  a  function  of  total  flight  hours 
(x-axis,  bin  size=100),  b)  the  same  data  with  a  natural  log-transformed  x-axis. 


INTRODUCTION 

Figure  la  shows  an  example  of  one  category  of  data 
we  frequently  see  in  general  aviation  (GA)  accident 
analysis.  This  is  a  histogram  of  fatal  accident  counts  for 
instrument-rated  pilots  of  GA  aircraft^  as  a  function  of 
pilots’  total  flight  hours  (TFH).^The  data  reflect  actual 
U.S.  National  Transportation  Safety  Board  accident  data 
from  1983-2000,  inclusive  (NTSB,  2011). 

It  is  not  hard  to  appreciate  the  usefulness  of  a  modeling 
function  here.  Such  a  function  would  smooth  the  noise 
in  the  data,  allowing  investigators  to  better  predict  how 
many  pilots  of  a  given  experience  level  are  likely  to  be 
involved  in  accidents  over  a  given  time  period.  This  would 
be  useful,  for  instance,  in  allocating  resources  for  pilot 
training,  or  as  the  basis  for  a  statistical  covariate  of  flight 
risk.  Even  a  casual  glance  at  Figure  1  shows  that  policy 
makers  would  want  to  focus  on  pilots  having  fewer  than 
5000  TFH,  simply  because  there  are  far  more  accidents 
in  that  range.  The  question  is  how  to  get  beyond  the 
considerable  noise  in  the  data  to  arrive  at  more  precise 
estimates  of  this  kind. 


^“GA  aircraft”  are  defined  here  as  “all  N- tail-numbered  aircraft 
operating  in  the  U.S.  under  all  Federal  Aviation  Regulations  (FAR) 
Parts  except  121  and  135,  regardless  of  airframe  type  or  weight.” 
^Total  flight  hours  includes  all  flight  time  logged  at  the  controls  of 
all  aircraft,  regardless  of  aircraft  class  or  category. 


METHOD 

Choosing  a  modeling  function 

Ideally,  we  like  modeling  functions  to  be  motivated 
by  theory  about  causal  processes  inherent  to  our  data. 
However,  in  the  case  of  aviation  risk,  we  cannot  expect 
these  processes  to  be  few  or  simple. 

Three  major  processes  influence  the  GA  accident  rate, 
two  of  which  have  their  own  set  of  sub-processes. 

1 .  Processes  that  affect  the  number  of  pilots  at  different 

values  of  TFH. 

a.  GA  flight  is  expensive,  both  in  time  and  money, 
plus,  it  takes  time  to  accumulate  flight  hours.  Hence, 
we  expect  that  many  pilots  will  have  relatively  few 
TFH,  with  ever-diminishing  numbers  of  pilots  as 
TFH  increase. 

b.  Pilots  accumulateTFH  at  different  rates,  which  may, 
themselves  change,  depending  on  the  season  of  the 
year,  employment  situation,  the  pilots  economic 
and  social  circumstances,  and  whim. 

c.  Commercial  pilots  (those  who  fly  as  paid  profes¬ 
sionals)  who  also  fly  GA  aircraft  have  their  com¬ 
mercial  hours  included  in  their  TFH.  U.S.  National 
Transportation  Safety  Board  (NTSB)  data  confirm 
that  commercial  flying  is  statistically  safer  than  GA 
flying.^  So,  for  these  pilots,  high  TFH  does  not 
imply  proportionately  higher  risk. 


^Several  hundred  people  are  regularly  killed  each  year  in  GA,  a  fatal 
accident  rate  about  40  times  greater  than  large  commercial  passenger 
air  carriers  such  as  United,  Delta,  and  American  Airlines  (source: 
www.ntsb.gov). 
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d.  Pilots  constantly  enter  and  leave  the  pilot  popula¬ 
tion  at  independent  rates,  due  to  economic  factors 
and/ or  old  age. 

e.  Pilots  leave  one  data  collection  category  and  enter 
another,  when  obtaining  a  new  category  of  pilot 
license  or  certification  (e.g.,  getting  an  instrument 
rating). 

2.  Processes  that  affect  each  individual  pilot's  flight  risk. 

a.  Pilots  differ  in  innate,  average  skill. 

b.  Flight  risk  tends  to  be  low  when  pilots  are  students, 
to  increase  when  they  first  begin  to  fly  solo,  then  to 
decrease  after  they  gain  experience  (Craig,  2001). 

c.  Some  aspects  of  flight  tend  to  be  more  danger¬ 
ous,  for  example  takeoffs,  landings,  night  flights, 
and  flights  in  or  near  severe  weather.  The  type  of 
flight  a  particular  pilot  typically  engages  may  differ 
from  another  pilot,  which  will  have  an  affect  on 
the  exposure  to  these  more  dangerous  maneuvers. 
For  example,  a  pilot  who  typically  makes  shorter 
flights  will  have  more  takeoffs  and  landings  per  unit 
time  period  than  a  pilot  who  typically  flies  longer 
cross-country  flights.  Therefore,  TFH  will  never  be 
a  perfect  proxy  for  risk. 

3.  Finally,  in  some  particular  data  (e.g.,  those  of  Figure  1), 

pilots  are  included  who  were  passengers  having  little 

or  nothing  to  do  with  the  cause  of  the  accident  itself^ 

Given  such  complexity,  we  might  despair  at  trying  to 
model  these  kinds  of  data.  Then  again,  as  George  Box 
observed  “All  models  are  wrong,  but  some  are  useful” 
(Box  &  Draper,  1987,  p.  424).  Perhaps  we  can  begin 
with  seeing  if  any  modeling  function  can  fit  these  data. 
If  so,  then  we  can  at  least  make  useful  predictions  about 
expected  accident  rates,  even  though  these  may  not  be 
purely  theoretic. 

In  the  present  work,  we  proceed  with  this  restricted  goal, 
using  a  standard  technique  of  minimizing  least-squares 
residuals  between  actual  data  and  a  simple  model  involving 
j  ust  two  component  functions.  The  first  component  will  be 
a  simple  log-transform.  The  second  will  involve  that  class 
of  particularly  useful  modeling  functions,  the  probability 
density  functions  (pdfs)  based  on  the  natural  logarithm  e. 
These  enclose  an  area  of  I.O  under  their  curves,  making 
them  useful  in  statistical  analysis  (Spanier  &  Oldham, 
1 987) .  That  class  includes  the  Gaussian  (normal) ,  Poisson, 
log-normal,  Weibull,  beta,  and  gamma  pdfs. 

^Data  source:  NTSB  downloadable  aviation  accident  database  www. 
ntsb.gov/avdata/.  Obviously,  some  readers  may  object  to  including 
pilots  who  were  not  directly  responsible  for  the  accident.  However, 
bear  in  mind  that  we  are  merely  describing  an  analytical  method  here, 
not  trying  to  support  specific  theoretical  statements  about  accident 
causation.  Moreover,  a  previous  study  conducted  by  the  author  indicate 
that  about  90%  of  GA  accidents  involve  single-pilot  flights,  where 
there  is  no  dispute  over  responsibility. 


To  illustrate  such  functions,  let  us  single  out  the  gamma 
pdf  (T^^.  This  is  a  2-parameter  function  with  a  shape  pa¬ 
rameter  a  >  0  and  a parameter  (3  >  0.  Gamma  pdfs  have 
been  used  to  model  a  wide  variety  of  processes,  including 
the  size  of  insurance  claims  (Hogg  &  Klugman,  1984), 
amounts  of  rainfall  (Ghiew,  Srikanthan,  Frost,  &  Payne, 
2005),  waiting  times  and  mean-time-to-failure  (where  it 
represents  time  until  the  ath  event  in  a  constant-hazard 
model),  and  distributions  of  microburst  wind  velocity 
(Mackey,  1998).  Since  the  GA  pilot  population  arguably 
consists  of  a  number  of  sub-populations,  some  having 
an  accident  rate  being  a  function  of  pilot  experience  and 
numbers  of  pilots — both  perhaps  inversely  related  to 
TFH — a  logical  function  to  test. 

The  basic  represented  as  a  function  ofx  (TFH), 
a  (alpha),  and  p  (beta) 


-xjp  a-\  n- 

T^,f{x-a,p)=  P 


r(«) 


(1) 


with  the  gamma  function  r(a)  itself  described  as  the 
Euler  integral  of  the  second  kind,  defined  for  a  >  0 

00 

(2) 

0 

Figure  2  shows  the  behavior  of  T^^ with  several  differ¬ 
ent  parameter  values.  We  can  easily  imagine  that  such  a 
function  could  be  fitted  to  our  kinds  of  data. 

Given  our  data,  a  number  of  practical  issues  arise: 

1 .  The  characteristically  long  tail  of  the  raw  data  results 
in  a  very  poor  fit  to  any  kind  of  pdf 

2.  A  location  (shift)  parameter  will  be  required,  since 
TFH  does  not  start  at  zero  for  instrument- rated  pilots. 

3.  An  amplitude  term  will  be  required  to  scale  the  unit 
pdf,  whose  area  under  the  curve  is  I.O,  to  the  binned 
data,  whose  area  under  the  curve  is  much  greater. 

4.  r^^should  rightfully  be  tested  alongside  other  plausible 
candidate  distributions. 

5 .  Once  we  arrive  at  a  preferred  fitting  function,  methods 
should  be  described  regarding: 

a.  Estimation  of  starting  values  for  constrained  pa¬ 
rameter  search 

b.  Goodness  of  fit  with  the  original  data 

c.  Galculation  of  selected  confidence  intervals 

d.  Quantizing  the  area  under  the  function 


Issue  I  suggests  a  very  simple  approach,  namely,  a 
compressive  transform  of  the  x-axis.  Indeed,  a  natural- 
log  compression  might  prove  suitable  for  “shrinking  the 
tail,”  to  make  the  raw  data  of  Figure  la  look  more  like 
something  found  in  Figure  2. 
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Issues  2  and  3  are  purely  practical.  A  location  parameter 
5  (delta)  will  overcome  the  problem  of  non-zero  initial 
data  values,  while  an  amplitude  term  (A)  will  scale  the 
area  under  the  pdf  to  our  data. 

These  modifications  of  Equation  1  lead  to  the  test¬ 
able  form: 


rJx;a,fi)  =  A- 


r{a) 


(3) 


Issue  4  requires  empirical  testing  of  a  reasonable 
set  of  candidate  pdfs  (Winkelmann,  2008).  Given  the 
nature  of  the  data,  we  can  immediately  rule  out  some 
candidates^  For  instance,  a  quick  glance  at  Figure  la 
shows  that  a  symmetrical  distribution  (e.g.,  Gaussian) 
obviously  cannot  fit  our  data.  Additionally,  we  probably 
want  a  function  capable  of  representing  overdispersion 
(variance  >  mean)  or  underdispersion  (variance  <  mean) . 
That  would  rule  out  the  Poisson,  where  the  variance  can 
only  equal  the  mean. 

Given  these  considerations,  gamma,  log-normal, 
Weibull,  and  beta  pdfs  remain  as  logical  candidates.  Because 
closed  solutions  do  not  exist  for  finding  optimal  parameter 
values  to  such  functions,  numerical  methods  must  be 
used.  These  present  their  own  challenges,  as  we  shall  see. 

For  this  study,  the  NonlinearFit  function  of  Math- 
ematica  7.0  (Wolfram,  2008)  was 
used  for  parameter  estimation.  For  un¬ 
constrained  parameters,  NonlinearFit 
offers  a  range  of  standard  numerical 
methods  (e.g.,  Newton-Gauss,  quasi- 
Newton,  Levenberg-Marquardt).  For 
constrained  parameters,  where  start¬ 
ing  and/or  final  parameter  values  at 
time  t  are  forced  to  lie  within  some 


range  p  .  <p  <p  ,  a  method  such  as  Karush-Kuhn-Tucker 

O  £  min  £  t  £  max 

(KKT)  is  preferable. 

An  alternative  approach  might  be  to  use  a  method  like 
simulated  annealing,  guaranteed  to  find  a  global  mini¬ 
mum  (Kirkpatrick,  Gelatt,  Vecchi,  1983;  Cerny,  1985). 
However,  such  methods  are  computationally  intensive 
and  slow,  providing  little  additional  benefit,  provided 
we  show  prudence  in  our  method. 

Finally,  Issue  5  suggests  finding  a  method  for  mapping 
the  area  under  the  fitting  curve  into  quantiles,  say  at  10% 
intervals.  For  the  time  being,  let  us  postpone  that  discus¬ 
sion  until  after  we  settle  on  a  single  fitting  function  and 
gain  some  experience  in  seeing  how  that  function  behaves. 

The  test  data 

Four  candidate  model  classes  were  tested:  beta,  gamma, 
log-normal,  and  Weibull  pdfs.  These  were  fitted  to  eight 
U.S.  GA  pilot  data  sets,  described  in  Table  1.  The  data  sets 
spanned  two  time  periods  (1983-2000  and  2001-May  15, 
2011),  two  categories  of  injury  (Serious  vs.  Fatal),^and  two 
categories  of  pilot  instrument  rating  (Instrument-rated  vs. 
Non-instrument-rated).  The  data  consisted  of  all  pilots 
involved  in  U.S.  GA  accidents  during  the  time  period  speci¬ 
fied,  regardless  of  whether  the  pilot  in  question  appeared  to 
be  legally  responsible  for  the  accident. 

Table  1.  Number  of  pilots  (a?)  in  the  /yWh  data  test  set 


Time  Period 

(/) 

1983-2000 

2001-2011 

Accident  Category 

(/■) 

SERIOUS 

FATAL 

SERIOUS 

FATAL 

Instrument-rated  pilots 

kk) 

ri'l'l^—  362 

rii2i—  831 

/i2jj=164 

n22i~^^^ 

Non-instrument-rated  pilots 

njj2=1051 

rii22~^  823 

n  21 

n222~^~^  1 

^During  review  of  this  paper,  one  reviewer  asked  about  the  negative 
binomial  function.  This  was  also  tested  but  eliminated  due  to  frequent 
optimization  failure. 


‘^NTSB  classifies  a  “serious”  or  “fatal”  accident  as  one  where  at  least 
one  person  onboard  at  least  one  airplane  was  seriously  or  fatally 


injured,  respectively. 
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The  raw  data  were  first  aggregated  into  histograms 
with  x-axis  bins  1 00  TFH  wide.  To  aid  visual  inspection, 
Appendix  A  shows  one  row  of  data  fits  using  Eq.  3  and 
a  second  row  using  Eq.  1 ,  where  the  log  transform  was 
performed  prior  to  the  data  fit.  The  latter  is  included 
because  that  particular  method  makes  it  much  easier  to 
visualize  the  underlying  functional  fit  differences. 

RESULTS 


Goodness  of  fit 

As  expected,  both  the  log-transform  of  TFH  and  the 
inclusion  of  the  location  parameter  8  proved  essential. 
Without  the  log-transform,  parameter  estimates  failed  to 
converge  for  nearly  all  datasets.  And,  without  8,  the  same 
held  true  for  instrument-rated  pilot  datasets. 

Even  with  the  log-transform  and  8,  however,  beta 
functions  rarely  produced  good  data  fits.  Therefore,  beta 
was  eliminated  from  further  consideration  as  a  modeling 
function,  and  for  the  sake  of  parsimony,  results  are  not 
shown  here.  The  three  remaining  fitting  functions  and 
eight  datasets  produced  24  models.  Appendix  A  shows 
these,  with  model  performance  and  parameter  estimates. 

Model  goodness-of-fit  can  be  expressed  by  a  variety  of 
metrics.  One  of  the  simplest  is  the  coefficient  of  determi¬ 
nation  7?^,  which  varies  between  0  and  1 ,  and  estimates 
the  proportion  of  explained  variance. 


R=l- 


total 


(4) 


Here,^  represents  the  predicted  value  of  j/  at  x,  versus 
the  observed  value  of  j/^.  Lower  R^s  merely  reflect  noisier 
data,  not  necessarily  fitting  failure. 

Two  aspects  of  the  data  affected  both  the  size  of  R^  and 
the  stability  of  model  parameter  estimates,  as  measured 
by  the  breadth  of  their  confidence  intervals.  As  expected, 
greater  random  variation  (noise)  in  the  data  and  smaller 
sample  size  both  led  to  lower  R^s.  Binning  the 

data,  of  course,  helps  dampen  noise,  but  at  the  expense 
of  lowering  the  effective  sample  size,  which  becomes  the 
number  of  bins  in,.)  rather  than  n  .  Figure  3  shows 
that  better  results  tended  to  occur  with  models  based 


on  >  300,  while  little  additional  advantage  resulted 

pilots  O 

from  n  >  1000. 

pilots 

Confidence  intervals  for  the  underlying  correlation 
r=Sqrt(7?^)  can  be  estimated  using  Fisher  s  method  (Glass 
&  Hopkins,  1984,  pp.  304-307). 


r  +  tanh(Z  +  )  =  r  ±  tanh 


0.51n 


A  +  |rp 

vl-Hy 


Vw  -3 


(5) 


where  Z  is  Fisher  s  Z-transform  of  r,  (sigma  sub-z)  is 
the  standard  error  of  Z,  and  is  the  normal  2:— value 
corresponding  to  the  desired  confidence  interval  (e.g., 
for  .95CI,  z^j=\3G). 

Based  on  7?^,  the  gamma,  log-normal,  and  Weibull 
pdfs  all  appeared  reasonable  candidates  for  use  with  this 
broad  range  of  accident  categories.  Appendix  A  shows  R^s 
ranging  from  .862-. 994,  considered  good-to-excellent. 

While  it  was  possible  to  get  Poisson  models  to  converge, 
the  R^s  were  uniformly  and  significantly  lower  than,  for 
instance,  those  of  Gamma  supporting 

exclusion  of  the  Poisson  from  further  consideration.  Table 
2  compares  the  two. 

The  three  remaining  model  classes  can  be  statisti¬ 
cally  compared  by  setting  up  R^s  as  if  each  of  the  eight 
datasets  were  an  “individual,”  and  the  three  models  were 
repeated  measures  experienced  by  each  “individual.”  The 
three  “Raw  data”  rows  in  Table  3  show  the  setup  of  this 
comparison. 


Table  2.  Comparing  Poisson  and  Gamma  R^s 


Dataset 

1 

2 

3 

4 

5 

6 

7 

8 

Mean 

Median 

Gamma 

.994 

.940 

.994 

.932 

.983 

.946 

.967 

.864 

.953 

.957 

Poisson 

.890 

.770 

.880 

.748 

.929 

.873 

.892 

.840 

.853 

.877 
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Table  3.  Comparing  Gamma,  Log-normal,  and  Weibull  R^s 


Dataset 

1 

2 

3 

4 

5 

6 

7 

8 

Mean 

Median 

Raw  data 

Gamma 

0.994 

0.940 

0.994 

0.932 

0.983 

0.946 

0.967 

0.864 

0.953 

0.957 

Log-normal 

0.994 

0.942 

0.994 

0.931 

0.982 

0.947 

0.964 

0.862 

0.952 

0.956 

Weibull 

0.993 

0.923 

0.994 

0.931 

0.982 

0.948 

0.968 

0.872 

0.951 

0.958 

Z-transformed 

Gamma 

2.903 

1.738 

2.903 

1.673 

2.380 

1.792 

2.044 

1.309 

2.093 

1.918 

Log-normal 

2.903 

1.756 

2.903 

1.666 

2.351 

1.802 

2.000 

1.301 

2.085 

1.901 

Weibull 

2.826 

1.609 

2.903 

1.666 

2.351 

1.812 

2.060 

1.341 

2.071 

1.936 

Figure  4.  a)  Frequency  histograms  for  1000  Monte  Carlo-simulated  R^s 
randomly  generated  by  one  of  our  models.  The  y-axis  shows  the  frequency 
count  of  R^s  in  each  x-axis  bin.  Note  the  negative  skew;  b)  Z-transform  corrects 
this  negative  skew. 


Quick  inspection  of  these  raw  R^s  shows  no  prominent 
differences  between  modeling  functions.  Analysis  of 
variance  (AN OVA)  can  confirm  that.  But,  first  we  have 
to  consider  that  our  theoretical  distributions  of  of  are 
not  expected  to  be  normal,  since  R^  is  range-restricted  to 
0<7?^<1.0.  AN OVA  is  famously  tolerant  of  some  devia¬ 
tion  from  normality,  but  these  R^s  are  high,  and  might 
be  a  problem.  Indeed,  Monte  Carlo  simulation  reveals 
considerable  negative  skewness,  as  Figure  4a  illustrates. 

We  can  correct  that  skewness  using  Fishers  Z- 
transform,  a  variant  of  Equation  5: 


^corrected 


0.5h 


1  + 

1- 

(6) 


Figure  4b  illustrates  the  resulting  improvement  in 
normality.  Applying  that  method  to  all  our  R^s  produces 
the  bottom,  “Z- transformed”  half  of  Table  3,  which  we 
can  now  more  legitimately  analyze. 

Subsequent  AN OVA  reveals  no  significant  differences 
between  the  gamma,  log-normal,  and  Weibull  R^s  (j?  =.429, 
NS,  Greenhouse-Geisser-corrected  for  non-sphericity). 


One  salient  feature  distinguishing  the  three  model 
classes  was  the  left-hand  side  of  the  curve.  As  Appendix 
A  makes  plain  in  In (x) -space,  Weibull  pdfs  tended  to  be 
“fatter”  on  the  left,  whereas  gamma  and  log-normal  pdfs 
tended  to  be  more  symmetrical  and  to  resemble  each 
other  more  closely.  The  exact  nature  of  the  leftmost, 
lowest-TFH  end  of  these  distributions  is  something  that 
can  be  investigated  more  closely  in  future  years,  as  data 
accumulate,  allowing  more  reliable  statistical  analysis. 

Final  choice  of  a  fitting  function 

Judging  solely  by  R^  and  it  would  be  imprudent 
to  recommend  one  model  over  another  on  the  grounds  of 
theory.  Arguably,  though,  the  gamma  pdf  T^^^is  most  useful 
for  two  reasons.  To  a  lesser  extent,  experience  with  these 
data  showed  that  T  was  the  easiest  function  to  fit.  More 
compellingly,  T^^^  allows  calculation  of  confidence  bands 
around  the  modeling  function  itself  These  confidence  bands 
provide  estimates  of  the  net  stability  of  predictions  based 
on  each  dataset.  Appendix  A  shows  how  these  confidence 
bands  are  influenced  by  dataset  size  n..^,  as  intimated  earlier 
by  Figure  3.  Smaller  datasets  are  considerably  less  reliable. 

For  these  pragmatic  reasons,  we  choose  to  examine 
r^^in  detail  for  the  rest  of  this  report. 
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Estimating  parameter  start  values  for 

The  raw  data  show  considerable  noise.  This  produces 
lumpy  parameter  residual  error  spaces  and  the  risk  of 
optimization  failure  due  to  local  minima.  Numerical 
methods  for  T^^^  parameter  estimates  therefore  benefit 
from  having  start  values  as  close  as  possible  to  final  values. 

Estimates  for  (X  and  (3  can  be  derived  by  the  method 
of  moments  (see  Appendix  B  for  details). 


(z:>J 


(V) 


Pest 


(8) 


A  start  value  for  the  amplitude  term  (A)  can  be  esti¬ 
mated  as 

A  _  y max,  _ y max _ 

“  (9) 

r(a) 

where,  of  the  binned  data,  y  is  the  maximum  function 
height,  and  is  the  height  of  the  model  r^^^ode^(Af )  of 


M^={a-\)p 


(10) 


Parameter  confidence  intervals  for  T  ,, 

pdf 

Parameter  confidence  intervals  can  be  estimated  with 
methods  based  on  the  Student  t-distribution 

PiCI  ~  Pi  —  ^^An-p,\-a/2)  (12) 

where  the  ith  parameter p.  is  augmented  or  diminished  by 
its  standard  error  (5£.)  times  the  value  of  corresponding  to 
n-p  degrees  of  freedom  and  the  desired  2-tailed  significance 
level  a  (the  acceptable  Type- 1 ,  or  false-positive  statistical 
error  rate,  not  to  be  confused  with  our  a  parameter  in 
With  binned  data,  n  will  be  the  number  of  bins  in 
each  dataset,  and  p=4,  the  number  of  parameters  in  T^^ 

(i.e..  A,  a,  (3,  8). 

Estimates  of  the  standard  error  SE.  are  beyond  the 
scope  of  this  report.  The  reader  is  referred  to  Ratkowsky 
(1989,pp.  36-42)  for  a  treatment  of  parameter  confidence 
intervals.  However,  a  number  of  common  statistical 
and  mathematical  software  packages  (e.g.,  SPSS,  SAS, 
MATLAB,  Mathematicd)  will  numerically  estimate  the 
appropriate  standard  errors. 

Using  r 

For  a  given  dataset,  to  estimate  the  expected  accident 
count  for  a  bin  width  of  100  TEH  centered  on  a  given 
value  of  x,  simply  populate  Equation  3  s  parameters 
from  Appendix  A  and  insert  the  desired  value  of  x. 
For  example,  for  non-instrument-rated  pilots  having 
fatal  accidents  from  1983-2000,  inclusive,  the  da¬ 
taset  “1983-2000  NIR  FATE,”  Equation  3  becomes 


Finally,  we  can  estimate  the  location  parameter  (8)  by 
forcing  the  peaks  of  both  the  data  and  T^^  to  take  the 
same  x-value.  This  leads  to 

^est  ^max,  binned  data  ^Mo  ( H ) 

In  practice,  the  parameter  spaces  are  lumpy  enough  so 
that  occasional  fit-failure  can  result,  even  with  reasonable 
starting  estimates  (with  these  eight  datasets,  this  happened 
once).  If  all  else  fails,  this  can  be  resolved  by  graphing  out 
the  fitting  function  for  the  log- transformed  data,  starting 
with  the  estimated  parameters,  then  hand-manipulating 
them  to  better  values  by  visual  inspection.  Figure  5  il¬ 
lustrates  this,  using  Mathematical  Manipulate  function, 
which  allows  function  parameters  to  be  adjusted  with 
sliders,  and  immediately  graphs  the  result. 


776.3 

whose  net  frequency  count  at  median  TEH  of  x’=340  is 
^pd/^p6f^  rp^^(340y)— 1 93,  which  we  can  see  is  correct  from 
its  plot  (Figure  6). 

Quantizing 

Appendix  A  shows  the  actual  median  (0.5  quantile) 
of  It  is  also  useful  to  have  a  method  for  dividing 
fitted  data  into  arbitrary  quantiles  that  can  be  precisely 
calculated,  for  instance,  into  groups  containing  equal 
percentages  of  area  under  the  fitting  curve  T  ^  Equation 
14  shows  a  method  for  non-log- transformecf  data,  which 
is  based  on  the  definite  integral  of  r^^(Eq.  3)  from  x=e^ 
to  the  x-value  corresponding  to  a  desired  quantile 
(e.g.,  0.8),  the  corresponding  area  under  the  curve, 
represented  as 

^qn 

^n=  \  ^pdf  (x,  5)dx  ( 1 4) 


^Assuming  OC  >1,  which  all  Ots  here  are. 
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Figure  5.  (top)  The  single  dataset  where  estimated  starting  parameters  (shown)  led  to  fit-failure;  bottom) 
Slight  manual  adjustment  of  those  initial  estimates  led  to  subsequently  successful  fit. 


Figure  6.  Model  for  dataset  “1983-2000  NIR  FAIL.”  The  dashed  line  denotes  the  median 
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There  is  no  closed-form  solution  for  arbitrary  x  . 
Fortunately,  an  iterative  numerical  method  is  easily 
stated.  We  note  that  the  minimum  value  of  T  ,^  =  0 

pdf 

in  the  linear  domain  is  specified  by  x=e^,  and  that  we 
normalize  the  area  under  given  that  it  computa¬ 
tionally  represents  n  pilots  put  into  bins  100  FH  wide. 


^  r  /  ^ 

/  J 

-^n 

In  other  words,  start  with  an  arbitrary  value  for  x, 
integrate  (Eq.  3)  to  find  the  area  under  the  curve 
from  to  x,  next  calculate  the  error  8  between  that 
area  and  our  desired  quantile,  and  then  adjust  x  in  the 
direction  that  minimizes  8,  halting  when  8  falls  below  a 
critical  tolerance  value  8^  (epsilon  sub-c).  This  is  easily 
done  with  software  such  as  Mathematical  given  a  simple 
statement  such  as 

FindArgMin[Abs[(NIntegrate[G^^^x],{x,E^,mu}]/ 
( 1 00*n))-qj ,  {mu, Quantile  [data, qj  }  ]  [  [  1  ]  ]  ;^ 

The  data  of  Figure  6  produce  the  plot  for  8: 


low  and  very  high  values  of  x  (here  being  TFH).  Often, 
we  find  that  conservative  interpretation  works  best.  In 
other  words,  we  consciously  choose  to  limit  high  “logical 
confidence”  in  predicted  accident  frequency  to  a  middle 
range  of,  say,  50-5,000  TFH. 

The  roots  of  this  conservativism  are  grounded  in  model 
construction,  sampling,  and  residual  error  spaces.  Residual 
error  is,  of  course,  the  average  squared  difference  between 
the  j/s  our  model  predicts,  given  the  x- values  of  the  data. 
Total  residual  error  is  the  quantity  we  wish  to  minimize 
by  adjusting  our  model  parameters.  Now,  what  we  some¬ 
times  see  is  that  this  residual  error  can  be  minimized  pretty 
well  by  a  range  of  models,  all  of  which  provide  a  pretty 
good  fit  to  most  of  the  data.  This  happens  when  we  have 
a  complex  residual  error  landscape  with  many  local  hills 
and  valleys  (as  opposed  to  a  simple,  monotonic  landscape 
with  just  one  deep,  global  minimum). 

The  exact  geometry  of  the  error  space  is,  of  course, 
largely  dictated  by  the  data.  But,  it  is  also  a  function  of 
the  model  and  of  how  the  data  are  set  up.  For  instance, 
suppose  we  have  two  accidents,  one  at  15,000  TFH, 
the  other  at  15,050  TFH.  In  that  event,  the  error  space 
very  much  depends  how  wide  our  sampling  bins  are  and 


Figure  7.  Plot  of  Eq.  15  for  “1983-2000  NIR  FAIL,”  here  minimizing  at  the 
median  =  x  ... 

qn  pdf 


Naturally,  the  minimization  algorithm  will  oscillate 
around  the  discontinuity,  but  this  poses  no  practical 
problem  to  finding  an  accurate  solution  for  the  median. 

A  conservative  appraisal  of  model  accuracy  at  ex¬ 
treme  X- values 

Extensive  experience  with  “extreme”  datasets,  such  as 
the  ones  encountered  here,  teaches  us  to  be  wary  of  what 
otherwise  may  seem  like  high-precision  modeling  at  very 

^Here,  Quantile  [data,  qn]  is  just  a  specification  to  the  function 
FindArgMin,  telling  it  to  find  the  minimum  o[\(rpdf(x)/100n)-qn\, 
starting  at  x-value  mu,  specified  as  the  desired  quantile  of  the  raw 
data,  which  Mathematica  conveniently  has  a  built-in  function  to  find. 


where  each  bin  starts  and  finishes  on  x.  In  the  very  long 
right-hand  tail  of  data  like  ours,  the  typical  actual  ac¬ 
cident  frequency  bin  count  is  going  to  be  0.  But,  if  our 
bins  are  wide  along  x,  or  spaced  in  a  certain  way  on  the 
x-axis,  instead  of  getting  two  bins,  each  I  unit  high,  we 
can  end  up  with  one  bin  2  units  high.  Amidst  a  sea  of 
bins  0  units  high,  the  residual  error  of  this  2-accident 
bin  will  be  almost  (2-0)^  =  4,  as  opposed  to  the  2(1-0)^ 
=  2  units  it  would  otherwise  be. 

The  point  is  that  the  right-hand  tail  of  distributions 
like  these  can  become  “logically  unreliable.”  Most  bins 
in  the  long  right-hand  tail  will  contain  zero  cases.  Con¬ 
sequently,  the  farther  out  on  the  x-axis  a  non-zero  bin  is. 
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the  greater  its  effect  on  model  parameters.  That  makes 
the  far  end  of  a  long-tailed  distribution  less  trustworthy 
than  we  would  like  it  to  be. 

The  left-hand  end  of  this  kind  of  distribution  is  also 
somewhat  troubled,  as  we  can  see  from  the  confidence 
intervals  in  Appendix  A.  Many  of  these  confidence  in¬ 
tervals  tend  to  be  wide  on  the  left,  which  also  happens 
to  be  a  region  of  relatively  few  accidents.  The  very  lowest 
TFH  bin  is  often  not  far  removed  from  the  very  tallest, 
in  which  case  a  small  5  shift  in  the  curve  to  the  right 
or  left,  can  accompany  a  large  change  in  the  amplitude 
parameter  A,  and/or  a  large  shift  in  a  and/or  (3.  All  this 
goes  to  show  that  highly  sloped  frequency  distributions 
can  sometimes  be  represented  by  a  multiplicity  of  pretty 
good  models  which,  nonetheless,  may  vary  widely  in  pa¬ 
rameter  estimates,  which  differ  most  in  their  predictions 
at  extreme  values  of  x. 

Therefore,  as  stated  previously,  we  are  prudent  to  put 
our  greatest  faith  in  the  middle  TFH  ranges  for  these 
models,  perhaps  in  the  50-5, 000 TFH  range.  Fortunately, 
that  is  the  range  most  interesting  to  most  of  us  under 
most  circumstances,  because  it  captures  the  vast  majority 
of  accidents. 

DISCUSSION 

Figure  8a  shows  a  frequency  histogram  commonly 
seen  in  general  aviation  (GA)  accident  analysis,  namely 
accident  count  as  a  function  of  pilots’  total  flight  hours 
(TFH).  A  modeling  function  would  be  useful  to  smooth 
the  noise  in  such  data,  allowing  investigators  to  better 
predict  how  likely  pilots  of  a  given  experience  level  are  to 
be  involved  in  accidents.  This  would  be  useful,  for  instance, 
in  allocating  resources  for  pilot  training  or  mentoring, 
and  as  the  basis  for  a  statistical  covariate  of  flight  risk. 

In  this  report  we  tested  a  number  of  candidate  mod¬ 
eling  functions  on  eight  samples  of  NTSB  GA  data 
encompassing  the  years  1983-2011.  Appendix  A  shows 


that  the  gamma,  log-normal,  and  Weibull  probability 
density  functions  were  all  able  to  fit  such  data,  given  x-axis 
data  bins  100  TFH  wide.  Estimates  of  goodness-of-fit 
{R^)  ranged  from  .8 6-. 9 9  (good-to-excellent)  and  did 
not  differ  significantly  across  those  three  model  classes. 

Log- transformation  of  TFH  proved  critical  to  the 
success  of  these  data-fits.  Untransformed  TFH  (e.g..  Fig. 
7a)  frequently  led  to  catastrophic  fit-failure. 

The  raw  data  exhibited  an  extremely  long  right-hand 
tail,  due  in  part  to  pilot  aging  and  dropout,  but  also  to 
the  confound  of  relatively  few  pilots  having  large  num¬ 
bers  of  commercial  FH  rolled  into  their  GATFH.  The 
log-transform  (Fig.  8b)  effectively  compressed  this  “long 
tail,”  allowing  successful  data-fit. 

Although  log-normal  and  Weibull  functions  appeared 
capable  of  fitting  these  data,  there  is  some  pragmatic 
advantage  to  using  a  gamma  pdf  (Equation  3).  A  gamma 
pdf  allows  estimation  of  confidence  intervals  around  the 
fitting  function  itself  (.95CI,  Fig.  8b).  The  width  of  these 
confidence  intervals,  of  course,  is  both  a  function  of  the 
sample  size  and  inherent  noise  in  the  data. 

Due  to  the  nature  of  the  data,  it  may  be  advisable  to 
place  the  greatest  prediction  confidence  in  a  middle  range 
ofTFH,  perhaps  from  50-5,000.  Fortunately,  that  is  also 
the  range  that  captures  the  vast  majority  of  all  GA  pilots. 

Because  more  than  one  function  class  seems  capable 
of  fitting  these  data,  no  simple  theoretical  claims  can 
be  made  about  causation.  Gausation  certainly  involves 
multiple  processes,  some  perhaps  embodying  sums  of 
independent  exponential  decay  processes  (gamma), 
“failure  rate”  processes  (Weibull),  and  multiplicative 
random  processes  (log-normal) .  Theorizing  is  hampered 
by  a  host  of  factors,  such  as  the  confounding  of  relatively 
safer  commercially  logged  flight  hours  counted  as  TFH, 
the  dropout  of  non-instrument-rated  pilots  to  become 
instrument-rated,  and  that  some  phases  of  flight  are  more 
dangerous  than  others,  meaning  that  flight  risk  is  not 
merely  linearly  proportional  to  time  spent  aloft. 


Figure  8.  Frequency  histograms  of  GA  fatal  accident  count  (y-axis)  with  a)  untransformed,  and  b)  natural  log  (In)- 
transformed  x-axis  (which  eventually  proved  essential  to  successful  data-fitting). 
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Therefore,  the  goal  of  the  current  effort  is  largely 
atheoretical,  being  merely  to  show  that  such  data  can  be 
modeled.  With  some  care,  GA  accident  frequencies  can  be 
predicted  fromTFH,  given  data  parsed  by  a)  pilot  instru¬ 
ment  rating  and  b)  seriousness  of  accident.  Goodness-of-fit 
{R^)  tended  to  be  excellent  for  non-instrument-rated  pilot 
data  and  good  for  instrument-rated  data.  Estimates  of 
median  TFH  were  derived  for  each  dataset,  which  will 
be  useful  to  aviation  policy  makers. 
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APPENDIX  A 


Comparing  data  fits  for  the  eight  datasets  of  Table  1.  The  x-axis  represents  TFH,  the  y-axis  is  aeeident 
frequeney  eount.  Dashed  vertieal  lines  show  Ihe  median  value  of  TFH  dividing  the  area  under  the  modeling 
eurve  area  in  half  A  .95CI  braekets  eaeh  gamma  pdf  The  first  row  of  models  has  a  linear  x-axis,  the  second  row 
has  a  ln(x)-axis,  which  allows  easier  inspection  of  data  fits  between  model  classes. 


Gamma 


Model  Class 

Log-Normal  Weibull 


1983-2000  non-instrument-rated  fatal 


Parameter  .95Cliower 

A  776.3  765.8 

a  7.789  5.996 

P  0.286  0.253 

6  2.821  2.559 

r  0.9968  0.9965 


■95C1„, 

786.8 

9.572 

0.319 

3.085 

0.9971 


Parameter  .QSCIiow^.  .gSCI-p^  |  Parameter  .QSCIiower  .QSCl.i™  I 


A  777.1 
[i  1.230 
a  0.228 
5  1.533 

r  0.9969 


766.9 
1.118 
0.204 
1.151 
0.9966 


787.4 
1.343 
0.250 
1.915 
0.9972 


A  782.0 
a  1.721 
P  1.431 
5  3.786 

r  0.9964 


767.2 
1.646 
1.376 
3.746 
0.9960 


796.9 
1.795 
1.486 
3.826 
0.9967 


1983-2000  instrument-rated  fatal 


I  Parameter  .QSCIiower  .9501.—  I  Parameter  .95Cliower  .95Clupper  Parameter  .95Cliower  .95Clupper 


A 

92.30 

88.87 

95.73 

A 

91.57 

88.35 

94.79 

A 

95.33 

89.58 

101.1 

a 

4.902 

3.189 

6.616 

0.500 

0.360 

0.640 

a 

2.217 

1.819 

2.614 

P 

0.311 

0.252 

0.369 

a 

0.389 

0.336 

0.442 

P 

1.636 

1.311 

1.962 

6 

5.011 

4.730 

5.291 

5 

4.787 

4.552 

5.022 

6 

5.011 

4.656 

5.365 

r 

0.9694 

0.9650 

0.9732 

r 

0.9706 

0.9664 

0.9743 

r 

0.9607 

0.9551 

0.9656 

A1 


Gamma 


Weibull 


Log-Normal 


1983-2000  non-instmment-rated  serious 


I  Parameter  .QSCIiower  I  Parameter  .95Cliower  .95Clupper  |  Parameter  ■95Cliower  ■95CI...™  I 

A  472.4  465.5  479.2  A  472.4  465.6  479.2  A  475.8  466.6  485.0 


a 

8.322 

6.163 

10.48 

p 

1.266 

1.133 

1.398 

a 

1.746 

1.667 

1.825 

p 

0.271 

0.236 

0.306 

a 

0.216 

0.189 

0.243 

p 

1.418 

1.361 

1.475 

6 

2.776 

2.476 

3.077 

6 

1.396 

0.930 

1.861 

6 

3.780 

3.738 

3.822 

r" 

0.9968 

0.9964 

0.9972 

r^ 

0.9968 

0.9964 

0.9972 

0.9968 

0.9964 

0.9972 

^The  equality  of  values  for  r=0.9968  across  distributions  here  is  simply  a  coincidence. 


1983-2000  instrument-rated  serious 


npiiots“362 

R^=0.932 

Xtfh-1067 

npiiots“362 

R^=0.931 

Xtfh-1  063 

npiiots“362 

R^=0.931 

Xtfh=1064 

1  Parameter 

■95CI|„wer 

1 

1  Parameter 

.95CI|o»er 

.95CI,«»,  1 

1  Parameter 

.95CI,ower 

.95CI{ipp^f 

A  42.12 

39.85 

44.38 

A 

42.36 

40.07 

44.65 

A 

42.35 

39.61 

45.09 

a  10.33 

2.150 

18.51 

p 

1.226 

0.765 

1.687 

a 

2.397 

1.878 

2.916 

P  0.208 

0.126 

0.289 

a 

0.193 

0.109 

0.277 

P 

1.644 

1.260 

2.028 

6  4.357 

3.449 

5.266 

5 

3.023 

1.434 

4.611 

6 

5.011 

4.600 

5.421 

r  0.9653 

0.9575 

0.9717 

r 

0.9651 

0.9572 

0.9715 

r 

0.9650 

0.9571 

0.9715 

A2 


Gamma 


Log-Normal 


Weibull 


2000-2011  non-instmment-rated  fatal 


A  219.7 
a  17.94 
P  0.216 
5  1.203 

r  0.9914 


213.5 

5.006 

0.141 

0 

0.9899 


226.0 

30.87 

0.290 

2.695 

0.9927 


A  219.7 

[i  1.611 

a  0.179 
6  0 
r  0.9912 


213.5 

1.305 

0.128 

-1.534 

0.9896 


225.9 

1.916 

0.230 

1.534 

0.9925 


A  231.2 
a  2.083 
P  1.921 
5  3.422 

r  0.9912 


207.5 

1.855 

1.704 

3.207 

0.9896 


218.9 

2.309 

2.138 

3.636 

0.9925 


2000-2011  instrument-rated  fatal 


Parameter 

A  54.03 
a  14.27 
P  0.218 
5  3.288 

r  0.9726 


.9501, ower 
51.70 
4.592 
0.146 
2.148 
0.9672 


■95Clyppe|. 

56.37 

23.96 

0.289 

4.429 

0.9771 


Parameter  .95Clic 


55.87 

1.840 

0.132 

0 

0.9732 


53.41 

1.351 

0.071 

-3.104 

0.9679 


■95CI.„  I 


58.34 

2.330 

0.193 

3.105 

0.9776 


Parameter 

A  55.15 
a  2.380 
P  2.027 
5  4.568 

r  0.9736 


.95CI,ower 

52.87 

2.099 

1.775 

4.303 

0.9684 


■95CI„pp^, 

57.43 

2.660 

2.279 

4.833 

0.9779 


A3 


Gamma 


Weibull 


Log-Normal 


2000-2011  non-instrument-rated  serious 


50- 

- 1 

f 

40 

/ 

/ 

\ 

7^ 

\ 

/ 

\ 

/ 

_ 

y 

npiiots“328  R^-0.967  xtfh-455  npiiots-328  R^-0.964  xtfh-456  npiiots-328  R^-0.968  xtfh-455 

I  Parameter  \  \  Parameter  .9501. —  .QSCl.n^^  1 1  Parameter  .QSCIiower  .9501—  I 


A 

113.9 

109.2 

118.7 

A  113.9 

109.1 

118.7 

A 

112.2 

107.6 

116.8 

a 

32.82 

-4.546 

70.19 

1.637 

1.258 

2.016 

a 

2.635 

2.188 

3.081 

P 

0.158 

0.071 

0.246 

a 

0.172 

0.110 

0.234 

P 

2.396 

1.972 

2.821 

5 

0 

-3.082 

3.082 

6 

0 

-1.947 

1.947 

5 

3.066 

2.635 

3.497 

r 

0.9831 

0.9791 

0.9864 

r 

0.9817 

0.9773 

0.9853 

r 

0.9841 

0.9803 

0.9872 

2000-2011  instrument-rated  serious 


Parameter  .95Cliower  .95Clupp«p  Parameter  .95Cliower  .95Clupppr  |  Parameter  ■95Cliower  95CI^  I 


A 

24.78 

22.97 

26.59 

A 

26.11 

23.58 

28.64 

A 

24.00 

22.67 

25.33 

a 

5.383 

1.285 

9.482 

|i 

1.786 

0.841 

2.731 

a 

1.547 

1.390 

1.703 

P 

0.376 

0.233 

0.518 

a 

0.152 

0.021 

0.283 

P 

1.396 

1.282 

1.509 

5 

4.115 

3.248 

4.982 

6 

0 

-5.711 

5.711 

5 

4.936 

4.851 

5.022 

r 

0.9295 

0.9051 

0.9477 

r 

0.9283 

0.9035 

0.9468 

r 

0.9336 

0.9106 

0.9508 

A4 


APPENDIX  B 


Estimates  for  the  gamma  pdfs  parameters  a  and  (3  can  be  calculated  using  the  method  of 
moments.  The  expected  value  (mean,  or  first  population  moment)  of  the  gamma  pdf  is 


E{x)  =  ap 

(16) 

while  the  second  population  moment  is 

(17) 

From  our  data,  we  can  estimate  the  first  moment  as 

y "  X. 

n 

(18) 

and  the  second  moment  as 

2 

— 

n 

(19) 

where  the  binned  data  have  been  shifted  to  start  at  zero  by  subtracting  the  x- 

-value  of  the  first  bin  from  all 

others. 

Letting 

II 

3 

(20) 

and 

a{a  + 1)/?^  = 

(21) 

we  can  solve  for  a  and  6  as 

ml 

a=  , 

(22) 

(23) 

m. 


Now,  using  these  estimates  for  a  and  6,  plus  the  expected  x-value  for  the  mode  {xmo)  of  the  pdf, 

Mo  =  {a-\)l3  (24) 


plus  ymax,  the  maximum  observed  j-value  in  the  data,  the  amplitude  term  (A)  can  be  estimated  as 

K  y  max  _  y  max 


^est 


y  mode 


~^Mol  P  ^  ^  ^ 


X 


Mo 


r(o) 


T(a) 


(25) 


B1 


APPENDIX  C 


(^^This  is  the  Matf>ema£rca  code  used  to  generate  the  data  for  the  paper 
SetDirectorv["M:\^ew  Projects\^TSB  Rec  Popup\JU1odeling  accidents  x  FH"];  FileNames[t 
filename  =  "Data  2000-2011 IR  SERS.txt";  "Data  1983-2000 IR  FATL.txt"  is  the  only  file  that  wouldn't 
converge  with  pre -estimated  parameters 
data  =  actualYs  =  predictedYs  =  {  };  maxLogFFI  =  0; 
myFile  =  OpenRead[filenamet 
onePoint  =  Read[m(yFile,  Number] ; 

While[ToString[onePoint]  t  "EndOfFile",  AppendTo[data,  onePointt  onePoint  =  Read[ni(yFile,  Number]  I 
Close[m(yFilet 
n  =  Length[datat 

FH  =  Median[datat  (^median  flight  hours  4^] 
binSize  =  100; 

base  =  Floor[data[[1]l  binSizet 

binnedData  =  binnedLnData  =  BinCounts[data,  binSize];  (^^Tally  data  into  bins  of  the  specified  size^^] 
nBins  =  Length[binnedLnDatat  modeLnFH  =  0; 

For[i  =  1,  i  <  nBins,  i++, 
tempi  =  binnedData[[i]t 
temp2  =  binnedLnData[[i]t 

Add  X- values  for  each  bin,  whose  midpoint  is  halfway  into  the  bin 
binnedData[[i]]  =  {N[base+  0  -0.5]  binSize  I  tempi }  ; 

binnedLnData[[i]]  =  {N[Log[base  +  (|  -  0.5]  binSize  ]l  temp2  }  ;  Add  Log -transformed  x-bin  values,  similarly 
inbinnedLnData[[i,  2]]  >  modeLnFH,  modeLnFH  =  binnedLnData[[i,  2]]] 
t 

minFH  =  binnedData[[1, 1]t  maxFH  =  binnedData[[nBins,  1]t 
minLogFH  =  binnedLnData[[1,  ^]l  maxLogFH  =  binnedLnData[[nBins,  ^]l 

Clear[A,  of,  ft,  ^  <r,  tf,  mu,  pPDF,  constraints,  initialValuest 
gammaModel  =  A  4<PDF[GammaDistribution[ff,  x  -tft 
LogNormalModel  =  Re[A  x  -  tf]t 

WeibullModel  =  Re[A  4<PDF[WeibullDistribution[ff,  x  -  6]l  of  is  shape  parm  and  fi  is  scale  parm  4^] 

selectedModel  =  gammaModel; 

IflselectedModel  ==  gammaModel,  modelName  =  "GammaModel";  constraints  =  {0  <  A  <  2000,  ^  >  0,  ^  >  0,  minLogFH  >  >  0}; 
logDataZeroed  =  N[Log[data]t  Here,  in  the  ln[x]  domain,  we  begin  estimation  of  starting  values  for  parameters 
logDataZeroed  =  logDataZeroed  -logDataZeroed[[1]t 
ml  =  Mean[logDataZeroedt  1st  moment  ■+) 
m2  =  Mean[logDataZeroed^2t  i*2nti  moment 
aa  =  ml  ^2  /  (m2  -  ml  ^2];  (:4£stimate  of  of,  based  on  data^^] 
bb  =  (m2  -  ml  ^2] /ml;  (^Estimate  of  based  on  data:4<] 
modeEstX  =  (aa  - 1]  bb;  (+  Est.  of  the  x- value  of  the  mode  +] 

modeEstY  =  PDF[GammaDistribution[aa,  bbl  modeEstX  t  (^^Est.  of  the  y- value  of  gamma  pdf  at  the  mode  4^] 

AA  =  modeLnFH  /  modeEstY;  Est.  of  amplitude  factor  needed,  based  on  max  val  of  data  /  y- value  at  the  esti  x- val  of  mode^^] 
dataMax  ={1,-99}; 

For[i  =  1,  i  <  Length[binnedLnDatal  i++,  inbinnedLnData[[i,  2]]  >  dataMax[[2]l  dataMax  =  binnedLnData[[i]]]t 

dataXMax  =  dataMax[[1]t  dataYMax  =  dataMax[[2]t 

temp  =  FindMaximum[AA  4<PDF[GammaDistribution[aa,  bbl  xl  xt 

pdfYMax  =  temp[[1  ]t  pdfXMax  =  temp[[2, 1, 2]t 

dd  =  dataXMax  -  pdfXMax;  {■+.  pdfXMax  will  also  =  (aa-1]  bb  4^] 

Print["m1: ",  ml, "  m2:",  m2,  "  aa:",  aa, "  bb:",  bb,  "  modeExtX: ",  modeEstX, "  modeEstY:", 
modeEstY, "  AA: ",  AA, "  dd: ",  dd  I 
initialValues  ={{A,  AA},{ff,  aa},  {fi,  bb},{tf,  dd}}; 

Iflfilename  ==  "Data  1983-2000  IRFATL.txt",  initialValues  =  {{A,  162},  {ff,  6.5},  {f,  .37},  {tf,  4}}l 
t 


Cl 


IflselectedModel  ==LogNormalModel,  modelName  =  "LogNormal  model";  constraints  ={0  <  A  <  5000,  o-  >  0,  minLogFH  >  >  0}; 
Iflfilename  ==  "Data  1983-2000  NIRFATL.txt",  initialValues  =  {{A,  777},  1.23},  {ff,  0.228},  {tf,  1.53}}l 

Iflfilename  ==  "Data  1983-2000  IRFATL.txt",  initialValues  ={{A,  92}, 0.5},  {a,  .39}, {tf,  4.8}} I 
Iflfilename  zz"Data  1983-2000  NIR  SERS.t xt",  initialValues  ={{A,  472},  {fi^  1.27},  {ff,  .22},  {tf,  1.4 }}l 
Iflfilename  ==  "Data  1983-2000  IRSERS.txt",  initialValues  ={{A,  42}, {fi^  1.23},  {a,  0.19}, {tf,  3}}]; 

Iflfilename  zz  "Data  2000-2011  NIR  FATL.txt",  initialValues  =  {{A,  220},  {fi^  1.6},  { ff,  .18},  {tf,  0.001  }}l 
Iflfilename  z=  "Data  2000-2011 IR  FATL.txt",  initialValues  =  {{A,  56},  {fi^  1.86},  { ^,  .13},  {tf,  0.001  }}t 
Iflfilename  zz  "Data  2000-2011  NIR  SERS.txt",  initialValues  =  {{ A,  1 14},  {fi^  1.6},  { ff,  .17},  {tf,  0.01 }}  ]; 

Iflfilename  zz  "Data  2000-2011  IR  SERS.txt",  initialValues  =  {{ A,  26},  {fi^  1.8},  { ff,  .15},  {tf,  0.01 }}  ]; 
t 

IflselectedModel  zz  WeibullModel,  modelName  =  "Weibull  Model";  constraints  =  {0  <  A  <  2000,  ff  >  0,  ^  >  0,  minLogFFI  >  tf  >  0}; 
Iflfilename  zz  "Data  1983-2000  NIR  FATL.txt",  initialValues  =  {{A,  782},  {ff,  1.72},  {fi,  1.43},  {tf,  3.8}}l 
Iflfilename  ==  "Data  1983-2000  IRFATL.txt",  initialValues  =  {{A,  95}, {ff,  2.22},  {^  1.64}, {tf,  5}} I 
Iflfilename  zz"Data  1983-2000  NIR  SERS.txt",  initialValues  ={{A,  475},  {ff,  1.75},  {fi,  1.4},  {tf,  3.8}}  t 
Iflfilename  zz"Data  1983-2000  IRSERS.txt",  initialValues  =  {{A,  42}, {ff,  2.4},  {fi,  1.64}, {tf,  5.0}}]; 

Iflfilename  zz  "Data  2000-2011  NIR  FATL.txt",  initialValues  =  {{ A,  213},  {ff,  2},  {^  1.9},  {tf,  3.4}}  ]; 

Iflfilename  zz"Data  2000-2011  IRFATL.txt",  initialValues  =  {{A,  55}, {ff,  2.38},  {fi,  2.0}, {tf,  4.6}}]; 
lflfilenamezz-Data2000-2011  NIRSERS.txt",  initialValues  =  {{A,  112},{ff,  2.6},  {^  2.4},{tf,  3.1}}]; 

Iflfilename  zz  "Data  2000-2011  IR  SERS.txt",  initialValues  =  {{A,  24},  {ff,  1.55},  {fi,  1.4},  {tf,  4.9}}  ] 
t 

nim  =  NonlinearModelFitlbinnedLnData,  {selectedModel,  constraints},  initialValues,  x,  AccuracyGoal 4,  PrecisionGoal 4, 
Maxiterations  -^200,  Method  -^"Automatic"  11:4^  Gradient -^"FiniteDifference"]  H 


lf| selectedModel  zz  gammaModel ,  A  =  nimlll,  2, 1, 2]t  ff  =  nimlll,  2, 2, 2]t  fi  =  nimlll,  2, 3, 2]t  tf  =  nimlll,  2, 4, 2]l 
(Loglx  ]  -  tf)'^-^  E 

linearModellx_]:=  A - ;  (+Tbis  is  tbe  linear  version,  using  parameters  est'd  in  tbe  log  domain  +) 

Gammalff] 

minf  =  SolvellinearModellx]zzO,  xHI,  1, 2]l 

pPDF  =  FindArgMinlAbsl  (NlntegratellinearModellxH^,  minf,  mu}]/(n  ^^binSize]]  -0.5Hmu,  Medianldata]}ll1]t 
Printl"A-^ ",  A, "  of=  ",  ff ,  "  fi  =  ",  fi,"  6  =  ",  tf, "  pPDF  =  ",  pPDF, "  n  =  ",  nj  n^iinSize  =  tbe  area  under  tbe  curve  +) 

IflselectedModel  ==LogNornialModel,  A  =  nlni[[1, 2, 1, 2]t  (i  =  nlni[[1, 2, 2, 2]t  <r  =  nlm[[1, 2, 3, 2]];  6  =  nlni[[1, 2, 4, 2]] ; 

(-^+Log[LQg[j!l-<!l|^ 

Re[^-  I 

linearModellx_]:= - [LoqI-^h]  r - _ 

V2ff 


(^^Tbe  numerical  estimate  of  tbe  same  definite  integral.  Throws  an  error  msg,  but  works 
pPDF  =  FindArgMinlAbsl  (NlntegratellinearModellxU^,  E^tf,  mu}]/(n  :4<binSize]]  -0.5Umu,  Medianldata]}ll1]] ; 
Printr'A^  ",  A, "  fi=  ",  "  ff  =  ",  ff, "  tf  =  ",  tf,  "  pPDF  =  ",  pPDF, "  n  =  ",  n] 

t 

lf[ selectedModel  ==  WeibullModel,  A  =nlm[[1,2, 1,2]t  <r  =  nlni[[1,2,2,2]t  ^  =  nlni[[1,2,3,2]t  f  =  nlm[[1, 2, 4, 2]t 

J  Log[^]^  jp: 

linearModellx_]:=  Re[A ^  ^  ff^"^ (Lo9l>f]-tf)  J 

{^^The  X- value  of  the  log  prediction  fen,  above  &  below  which  =  50%  area  under  the  curve  4^} 
pPDF  =  FindArgMinlAbsl  (NlntegratellinearModellxU^i  E^tf,  mu}]/(n  :4<binSize]]  -0.5Umu,  Medianldata]}ll1]]; 
Printr'A^  ",  A, "  ff=  ",  ff ,  "  fi  =  ",  ^  "  tf  =  ",  tf, "  pPDF  =  ",  pPDF, "  n  =  ",  n] 

Blockllx},  Forli  =  1,  i  <  LengthlbinnedLnDatal  i++,  AppendTolactualYs,  binnedLnDatalli,  2]]t  x  =  binnedLnDatalli,  1]t 
AppendTolpredictedYs,  selectedModel  ]]t  (^End  Block  4^] 
maxActualYs  =  MaxlactualYst 

gridlines  ={};  Forli  =  0,  i  <  CeilinglmaxActualYs,  101  i  +  =  (CeilinglmaxActualYs,  10]/ 10],  AppendTolgridlines,  \]l 
r  =  CorrelationlactualYs,  predictedYst 


C2 


,  1  +  Abs[r] , 

Z  =  0.5  Log  - b  (+  Fisher's  Z -transform  of  the  correlation  r.  See  Glass  &  Flopkins,  p.  304-307  +) 

^  1  -  Ahs[r]  ^ 

=  N[1  /Sqrt[n  -  3]  ];  (+  Standard  error  of  Fisher's  Z  +) 

Tupper  Cl  =  Tanh[Z  + 1.96  I  (^flyperholic  tangent.  See  Glass  &  Flopkins,  p.  305.10  +) 
i^iower  Cl  —  Tanh[Z  — 1.96  I 

Print["r  =  ",  r, "  Upper  .95CI  =  ",  rupper  ci,  "  Lower  .95CI  =  ",  How  ci, "  r^  =  ",  r^2  J 

Print["MMCA  R  squared:  ",  nlm["RSquared"l "  MMCA  Adj  R  squared:  ",  nlm["AdjustedRSquared"]t 
Print["MMCA  ParameterConfidenceIntervalTahle: ",  nlm["ParameterConfidencelntervalTahle",  ConfidenceLevel  -^.95]t 
(4Print["MMC  A  CovarianceMatrix:  ",T  ahleForm[nlm["CovarianceMatrix"  ]]l  +) 

(^rint["MMC A  ANOVATahle:  ",hd =nlm[" ANOVATahle"  I"  MSE:  ",nlm[" ANOVATahleMeanSquares"  ]l 
linearXRange  =  3000; 

linear  AxesTicks  ={50, 150,  250,  350,  450,  550,  650,  750,  850,  950, 1050, 1550,  2050,  2550}; 

g1  =  ListPlot[hinnedLnData,  PlotRange  All,  Filling  Axis,  FillingStyle  -^Redt  (^^These  already  have  a  log -scaled  x 

CI95[x_]  =  Tahle[nlm["MeanPredictionBands",  ConfidenceLevel  -^.95]  J 

g2  =  Plot[{  nlm[x],  CI95[x]},{x,  tf,  maxLogFH},  Filling  ^{1  ^  {{2},  Yellow}},  PlotRange  ^  All,  PlotStyle  ^{Thickness[.002], 
RGBColor[0,  0, 1]}  t 

g3  =  Graphics[{Thickness[.002l  Dashing[.0H  Line[{{pPDF,  0},{pPDF,  maxActualYs}}]}^ 

g5  =  ListPlot[hinnedData,  PlotRange  -^{{0,  linearXRange},  All},  Filling  Axis,  FillingStyle  -^Redt  {^^These  have  a  log -scaled  x 
g6  =  Plot[{linearModel[x],  CI95[Log[x]]},{x,  E^tf,  linearXRange},  Filling  ->{1  ->  {{2},  Yellow}},  PlotRange  All, 

PlotStyle  ^{Thickness[.002]}t 

theLahel  =  StringJoin[modelName, ",  ",  filename, ",  n^ccidents  =  ToString[nl ",  nbins  =  ToString[nBinsl 
"  .50  Quantile  (pPDF)  =  ",  ToString[fiPDF]] 

Show[g1,  g2,  ImageSize  -^{800,  400},  PlotRange  maxLogFFI},  All},  AspectRatio  -^Full,  AxesLahel  -^{"Log[TFFir,  "Count"}, 
GridLines  -^{None,  gridlines},  AxesOrigin  0}] 

Show[g5,  g6,  g3,  ImageSize  -^{800, 400},  PlotRange  -^{{0,  linearXRange},  All},  AspectRatio  -^Full,  AxesLahel  -^{"TFFI",  "Count"}, 
GridLines  -^{None,  gridlines},  AxesOrigin  -^{0, 0},  Ticks  -^{linear AxesTicks,  Automatic}] 

Nlntegrate[linearModel[xHKi  pPDF}]/(n  ^^hinSize)  Double -check  calculation  of  the  median  area  under  the  curve 

This  is  the  method  of  finding  quantiles  4^) 
qn=0.9;  (+0<qn<1 +) 

FindArgMin[Ahs[  (Nlntegrate[linearModel[xH  x,  mint,  mu}]  /  (n  ^^binSize]]  -  qnH^u,  Quantile[data,  qn]}l[1  ]] 
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